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Abstract 

We propose a sequential learning policy for noisy discrete global optimization and 
ranking and selection (R&S) problems with high dimensional sparse belief functions, 
where there are hundreds or even thousands of features, but only a small portion of 
these features contain explanatory power. We aim to identify the sparsity pattern and 
select the best alternative before the finite budget is exhausted. We derive a knowl¬ 
edge gradient policy for sparse linear models (KGSpLin) with group Lasso penalty. This 
policy is a unique and novel hybrid of Bayesian R&S with frequentist learning. Partic¬ 
ularly, our method naturally combines B-spline basis expansion and generalizes to the 
nonparametric additive model (KGSpAM) and functional ANOVA model. Theoretically, 
we provide the estimation error bounds of the posterior mean estimate and the functional 
estimate. Controlled experiments show that the algorithm efficiently learns the correct 
set of nonzero parameters even when the model is imbedded with hundreds of dummy 
parameters. Also it outperforms the knowledge gradient for a linear model. 

Keywords: sequential decision analysis, sparse additive model, ranking and selection, 
knowledge gradient, functional ANOVA model 

1 Introduction 

The ranking and selection (R&S) problem arises when we are trying to hnd the best of 
a set of competing alternatives through a process of sequentially testing different choices, 
which we have to evaluate using noisy measurements. In specihc, we are maximizing an 
unknown function fi{x) : A i—)• M, where A C M"* is a finite set with M alternatives. We 
have the ability to sequentially choose a set of measurements to estimate. Our goal is to 
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select the best alternative when the finite budget is exhausted. We assume that experiments 
are time consuming and expensive. This problem arises in applications such as simulation 
optimization, medical diagnostics and the design of business processes. In such applications, 
the number of underlying parameters might be quite large; for example, we might have to 
choose a series of parameters to design a new material which might involve temperature, 
pressure, concentration and choice of component materials such as catalysts. 


The early R&S literature assumes a lookup table belief model 


Frazier et al. 


2003,120091], 


but recent research has used a parametric belief model, making it possible to represent 
many thousands or even millions of alternatives using a low-dimensional model. Let n be 
the vector representing values of all alternatives. Linear beliefs assume the truth /x can be 
represented as a linear combination of a set of parameters, that is, /x = Xct, where cx is the 
underlying coefficient and X represent the alternative matrix, that is, each row of X is a 
vector representing an alternative. 

The problem is that there are many applications that is high dimensional, that is, the 
coefficient o: can potentially have hundreds or even thousands of coponents. For example, in 
learning the accessibility profile of a large RNA molecule, the underlying weight coefficient 
describing the accessibility of each site is high dimensional due to the large size of RNA 
molecule. However, it is t ypically the case t hat only a small portion of these coefficients 
contain explanatory power Reves et al.l . l2014i | . 

More generally, for these applications, we propose a sparse additive model which offers 
considerably more flexibility than a linear model, while recognizing that the final model will 
be relatively low dimensional. Sparse additive model assumes the truth takes the form 

+ 12 (^ 12 ) + • • • + fp{Xip) + Q, for i = 1,..., M, 


where the fjs are one-dimensional smooth functions, q is some Gaussian noise and M is the 
number of competing alternatives. In high dimensional settings, we assume that most of the 
fjS are zeros. If each fj is a linear function, then the sparse additive belief reduces to linear 
belief. In this model, we are working on a model with a potentially large number of features, 
most of which do not contribute significant explanatory power. Our challenge is not only to 
design an efficient search algorithm for identifying the best alternative, but also identify the 
underlying sparsity structure. 

In this paper, we study high dimensional optimal learning with sparse beliefs. We first 
derive a knowledge gradient policy for linear models (KGSpLin) with ^ 1^00 group Lasso 
penalty. More generally, we can assume the belief function takes an additive model, which 
is a summation of unknown smooth functions of each feature, where only a few components 
are nonzero. If we approximate each smooth function by B-splines basis, the sparse additive 
model can be fitted using group Lasso. Therefore, KGSpLin can be naturally generalized to 
the knowledge gradient policy for sparse additive models (KGSpAM). Here we introduce a 
random indicator variable and maintain a Beta-Bernoulli conjugate prior to model our belief 
about which variables should be included in or dropped from the model. Additionally, in 
the broader class of models known as multivariate splines functional ANOVA model, tensor 
product B-splines can be adopted. KGSpAM can also be used in this model. 
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The remainder of the paper is organized as follows. Section [3] formulates the ranking 
and selection model in a Bayesian setting and establishes the notation used in this paper. 
It also highlights the knowledge gradient using both lookup table and a low dimensional 
linear, parametric belief model and introduces the homotopy algorithm for recursive ii^oo 
group Lasso. Section 0] is devoted to a detailed description of the KGSpLin policy for 
high dimensional linear models with .^ 1,00 group Lasso. Section O generalizes the algorithm 
to nonparametric sparse additive belief model (KGSpAM) and also SS-ANOVA. Theoretical 
results are presented in Section[6l which shows the estimation error bounds for both posterior 
mean and functional estimate. In Section [71 we test the algorithm in the context of a series 
of controlled experiments. 


2 Literature 


There has been a substantial literature on the general problem of finding the maximum 


of an unknown function w’ 


searching decisions. ISoall 


rere we rely on making noisy measurements to actively make 


20051 1 provides a thorough review of the literature that traces 


its roots to stochastic approximation methods. However, these methods require lots of 
measurements to find maxima precisely, which is unrealistic when measurements are very 
expensive. 

Our problem originates from the R&S problem, which has been considered by many au¬ 
thors, under four distinct mathematical for mulations. We specifically co nsider the Bayesian 
formulation, for which early work dates to 


Raiffa and Schlaifei 


ematical formulations are the indif-ference-zone formulation 


optimal computing budget allocation, or OCBA [Chenl . 


large-deviations approach [Glvnn and Juneiai. 


19^ |. The other m ath- 


2nifl . Tchen et al 


Bechhofer et al 


199,511 : the 


2012 ]: and the 


2004 


In the Bayesian formulation, this RfcS problem has received considerable attention un¬ 
der the umbrella of optimal learning Powell and Rvzhovl . l2012l |. In this work, there are 
three major classes of functi on approximation m e thods : look-up tables, parametric models 
and nonparametric models. iGuota and Miesckel 199fit | in troduces the idea of selecting an 


altermative based on the marginal value of information. [Frazier et al.l 2008l | extends the 
idea under the name knowledge gradient using a Bayesian approach which estimates the 
value of measuring an alternative by the predictive distributions of the means, where it was 
shown that the policy is myopically optimal by construction and asymptotically optimal. 
The knowledge gradient using a lookup table belief model approximates the function in a 
discrete way, without any u nderlying explicit s t ructu ral assumption, for both uncorrelated 
and correlated alternatives 
found in 


Chick et al. 


Frazier et al 


200a, I 2 OO 9 I I . Another closely related idea can be 


200 1[ | , where samples are allocated to maximize an approximation to 


the expected value of information. iNegoescu et al.l 201 1[ | introduces the use of a paramet¬ 
ric belief model, maki ng it possible to solve problems with thousands of alternatives. For 
nonparametric beliefs, iMes et al.l 201 ll | proposes a hierarchical aggregation technique using 


the common features shared by alternatives to learn about many alternatives from even a 
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single measurement, while iBarut and Powelll 2ni.‘ll | estimates the belief function using kernel 
regression and aggregation of kernels. However, all the methods above assume low dimen¬ 
sional belief models, where the number of features is relatively small. There are applications 
with hundreds or even thousands of features, but where only a few features are relevant. In 
such settings, previous algorithms may require a lot of tedious computation on the overall 
features. 

Additionally, outside of the Bayesian framework, there is another line of research on 
sparse online learning, in which an algorithm is faced with a collection of noisy options of 
unknown value, and has the opportunity to engage these options sequentially. In the online 
learning literature, an algorithm is measured according to the cumulative value of the options 
engaged, while in our problem we only need to select the best one at the end of experiments. 
Another difference is that, rather than value, researchers often consider the regret, which 
is the loss of our option compared with the optimal decision in hindsight. Cumulative 
value/regret is appropriate in dynamic settings such as maximizing the cumulative rewards 
(learning while doing), while terminal value/regret fits in settings such as finding the best 
route in a transportation network (learn then do). Moreover, most of the algorithms in 
online learning are based on stochastic gradient/subgradient descent met hod. The key idea 


2009 

Langford et al. 

2009. 

Xiao 

2010 

Lin et al. 

2011 

Chen et al. 

2012 . 

Ghadimi and Lan 


2ni2t |. However, a major problem with these methods is that while the intermediate solutions 


are sparse, the final solution may not be exactly sparse because it is usually obtained by 
taking the average of the intermediate solu tions. 

Additive models were first proposed by 


Friedman and Stuetzk 


198ll ] as a class of non¬ 


par am etric regression models and has received more attention over the decades 


Hastie and Tibshirani 


1990(]. In high dimensional statistics, there has been much work on estimation, prediction and 


model selection for penalized methods on addit i ve model Zhang et al 


200a, 


Ravikumar et al 


20091, 


Fan et al 


2011 


Guedi and Alauieii . 


201 


20041 


Lin and Zhand . 


Sparsity is a fea¬ 
ture present in a plethora of natural as well as manmade systems. In optimal learning 
problems, it is also natural to consider sparsity structure not only because nature itself is 
parsimonious but also because simple models and processing with minimal degrees of free¬ 
dom are attractive from an implementation perspective. Most of the previous work on sparse 
additive models study it in a batch setting, but here we study it in an active learning setting, 
where not only observations come in recursively, but also we get to actively choose which 
alternative to measure. 


3 Notation and Preliminaries 

In this section, we briefly review some results from Bayesian models for R&S and the 
recursive algorithm for d-i^oo group Lasso. We start with an introduction of notation: Let 

M = [Mij] G and v = [ui,_ ,Vd]'^ € We denote vj to be the subvector of v 

whose entries are indexed by a set I. We also denote j to be the submatrix of M whose 
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rows are indexed by I and columns are indexed by J. For I = J, we simply denote it by 
M/ or Mj. Let M/* and M*j be the submatrix of M with rows indexed by I, and the 
submatrix of M with columns indexed by J . Let supp(u) := {j : Vj / 0}. For 0 < p < oo, 
we define the io,^p,^oc vector norms as 

d 

Iloilo := card(supp(u)), ||u||p := and||u||oo := max |uj|. 

^ ^ l<2<d 

i=\ 

For a matrix M, we define the Frobenius norm as: HMHj? := the 

^p norm to be: ||M||p = max||„||p=i ||Mt;||p. For any square matrix M, let Amax(]V[) and 
Amin(M) be the largest and smallest eigenvalue of M. For a summary of most symbols we 
use, please refer to Table [2] in Appendix A. 

3.1 The Bayesian Model for Ranking and Selection 

We denote the unknown function p{x) : A” i—)• M, where X C M"* is a finite set with M 
alternatives. In addition, if we have a measurement budget of N, our goal is to sequentially 
decide which alternatives to measure so that when we exhaust our budget, we have maximized 
our ability to find the best alternative using our estimated belief model. Here we use x 
to denote the vector and x to denote the corresponding alternative index, that is, x € 
{!,..., M}. We also use px for p{x). Let /r = [pi,..., puY'■ Under this setting, the 
number of alternatives M can be extremely large relative to the measurement budget N. In 
a Bayesian setting, we assume pu takes multinormal distribution 

/x~W(0,5]). (1) 

Now suppose we have a sequence of measurement decisions, x^, ..., x^~^ to learn about 

these alternatives. Here £c* E A”, for z = 0,... , — 1. At time re, if we measure alternative 

X, we observe 

where ~ AA(0, cr^) and ax is known. 

Initially, assume we have a multivariate normal prior distribution on /x. 

Additionally, because decisions are made sequentially, a;"' is only allowed to depend on the 
outcomes of the sampling decisions x^,x^,, x"'~^. In the remainder of the paper, a random 
variable indexed by re means it is measurable with respect to which is defined as the 
cr-algebra generated by {{x^,y^o), (a:^, y^i),..., {x^~^, y^n-i)}. Following this definition, we 
denote 0” := and X)” := Var[^|J-'"]. It means conditionally on our posterior 

belief distribution on pL is multivariate normal with mean 0”' and covariance matrix H”. 
When the measurement budget of N is exhausted, our goal is to find the optimal alternative, 
so the final decision is 

N oN 

X = argmax 0^ . 

xex 
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We define II to be the set of all possible policies satisfying our sequential requirement; that 
is, n := G Let indicate the expectation with respect to the 

prior over both the noisy outcomes and the truth // while the sampling policy is fixed to 
TT G n. After exhausting the budget of N measurements, we select the alternative with the 
highest posterior mean. Our goal is to choose a measurement policy maximizing expected 
reward, which can be written as 


sup E’’ 

ttGII 


max. 6^ 
x£X 


We work in the Bayesian setting to sequentially update the estimates of the alternatives. 
At time n, suppose we select = x and observe we can co mpute the n + 1 tim e 

posterior distribution with the following Bayesian updating equations 

n+l _ nn 


Gelman et al. 


20031]: 


6,n+l = Qn^ 


Vx 


^n+l _ 


= E’* - 


r2 

X 

^n 


+ ^xx 


(2) 




(T“ + 


where is the standard basis vector with one indexed by x and zeros elsewhere. 


3.2 Knowledge Gradient for Linear Belief 

In this section, we briefly review the knowledge gradient for correlated normal beliefs 
KGCB), which is a fully sequential sampling policy for learning correlated alternatives 


Frazier et al 


20081 ] . Here correlated alternatives mean that the performances of different 


alternatives may have correlations as described in ([T]). We also review the knowledge gradient 
for a linear belief model (KGLin). It means that the belief model is linear in terms of a set 
of know n basis functions. In this case, Bayesian updating is performed using recursive least 


squares [Frazier et al.l . 120091 ]. We represent the state of knowledge at time n as: 5” := 
(0”, E)”). The corresponding value of being in state S'” at time n is 

F”(S”) = max0L. 

The knowledge gradient policy is to choose the alternative that can maximize the expected 
incremental value. 


V, 


KG,n _ 


= E(F”+1(S”+Hx)) - F”(S”)1S”, x^ = x) 
= E(max V, X^ = x)- max 0”, 


and 


x'ex 


= argmaxuf^’”. 


x'ex 


x&X 


Frazier et al 


20091 ] proposes an algorithm to compute the KG values for alternatives with 
correlated beliefs. First we can further rearrange equation ([2]) as the time n conditional 
distribution of 0”"*“^, namely. 


0n+i ^ + 


(3) 
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where 




Zn+i 


V + ^xx 

(?/r ^ - ^g) 

^Var[yr' - 


(4) 


It is easy to see that is standard normal when conditioned on 
Then we substitute equation dS]) into the KG formula, 


Frazier et al. 


2008 


V 


KG,n 

X 


E(max0”/ + <Ta,/(5]” ®” = x) — max0”/ 

x'ex ^ x'ex ^ 

/r(0”,5(5]”,®)), 


where cr(5]”,®) is a vector-valued function defined in m and cr3;/(5]”, ®”) indicates the 
component ej,5(5]”,®”) of the vector 5(5]”,®”) and h{a,b) = E[maxjai -|- 6jZ] — max* a* 
is a generic function of any vectors of the same dimension, Z is a standard normal random 
variable. 

The expectation can be computed as the point-wise maximum of affine functions Ui + biZ 
with an algorithm of complexity 0{M‘^log{M)). It works as follows. First the algorithm 
sorts the sequence of pairs (a*, bi) such that the 6jS are in nondecreasing order and ties in b 
are broken by removing the pair (a*, bi) when bi = and a* < aj+i. Next, all pairs (a*, bi) 
that are dominated by the other pairs, that is, a* -|- biZ < max^yjOj -|- bjZ for all values of 
Z, are removed. Thus the knowledge gradient can be computed using 


KG 


= h{a,b)= ^ {bi+i-bi)f 




^2+1 


bi+i - bi 


where f{z) = 4>{z) -|- z^{z). Here 4>{z) and <h(z) are the normal density and cumulative 
distribution functions respectively, a and b are the new vectors after sorting a and b and 
dropping off the redundant components and are of dimension M. 

If the number of alternatives is quite large, the above representation becomes clumsy. 
Thus if the underlying belief model has some structure, then we could take advantage of this 
structure to represent the model and simplify the computation. In a simple case, if / has a 
linear form or can be written as a basis expansion, we can make it easier by maintaining a 
belief on the coefficients instead of the alternatives. 


Negoescu et al.l 20 111 ) further extends KGCB to parametric beliefs using a linear model. 


Now we assume the truth /x can be represented as a linear combination of a set of parameters, 
that is, /X = Xci, where /x G and a = [ai,... ,am]'^ £ R”^ are random variables, 
X G R^^™ represent the alternative matrix, that is, each row of X is a vector representing 
an alternative. If we assume a ~ AA('5, 5]’^), this induces a normal distribution on /x via 
linear transformation, 


/x~AA(X5,X5]’’X^). 
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At time n, if we measure alt ernative = x, w e can update 19 "'+^ and recursively 

via Recursive Least Squares [Powell and Rvzhovl . |2012[ | , 


^n+l = 


■ p ^+1 




^i9,n+l _ ^ 


where and 7 ” = cr^ + 

The linear model allows us to represent the alternatives in a compact format since the 
dimension of the parameters is usually much smaller than the number of the alternatives. 
Suppose we have tens of thousands of alternatives, without the linear model, we would need 
to create and update the covariance matrix 51” with tens of thousands of rows and columns. 
With the linear model, we only need to maintain the parameter covariance matrix the 

size of which is equal to the dimension of the parameter vector i?. In addition, we never need 
to compute the full matrix We only have to compute a row of this matrix. 


3.3 A Homotopy Algorithm for Recursive £1 00 Group Lasso 

In the Bayesian updating scheme described in Section H] and O a recursive £ 1^00 group 
Lasso is required, which we review in this section. When the regularization takes the ii 
norm, this regulari zed version with least squares loss is Lasso (least absolute shrinkage and 


selection operator) Tibshiranil . Il99fi[ | . It is well known that Lasso leads to solutions that are 
sparse and therefore achieves model selection. If we consider a more general group sparsity 
system, which is composed of a few nonoverlapping clusters of nonzero coefficients, group 
Lasso penalty can be used to encourage correlations within groups and achieve sparsity at a 
group level. Here we briefly descri be the recursive homotopy algorithm for ii^oo group Lasso 
propose d in 
refer to 


Chen and Hero 


20121. For t he recursive homotopy algorithm for Lasso, one can 


Garrigues and El Ghaouil [2008[ |. This algorithm computes an exact update of the 


n 

/3" = argmin ^ ^ [(, 
i=l 


X 


i-l\T 


/3-yf + A-||/3||i,c 


(5) 


optimal ii^oo penalized recursive least squares predictor. Each update minimizes a convex 
but nondifferentiable function optimization problem. This algorithm has been demonstrated 
to have lower implementation complexity than direct group Lasso solvers. It also fits the 
recursive setting in optimal learning. 

The li^oo group Lasso estimator after n observations is given by 

1 

in 

where G M x W^,i = l,...,n are the n observations. A” is the regularization 

parameter, and ||/3||i,oo := Ylj=i ll/^Sjlloo- {Gj}^=i is the group partition of the index set 
Q = {1,..., m}, that is, 

ij%,gj = g, e,ne' = 0 if 

and Pg. is a subvector of /3 indexed by gj. Let dj = \gj\ be the number of features in the 
jth group, and m = Yl^=i dj- Group Lasso reduces to Lasso when each group contains only 
one coefficient. 


















At time n, suppose we have /S” to the Lasso with n observation and we are given the 
next observation a;”) G M x M”*. The algorithm computes the next estimate via 

the following optimization problem. Let 
us dehne a function 

u{t, A) = argmin + tx^y'^^^) + A||/3||goo- 

^eK’" ^ 

We have /3” = u(0,A"') and (3'^'^^ = u(l,A”"'“^). The homotopy algorithm that computes a 
path from /3"' to in two steps: 


1 Fix t = 0, vary the regularization parameter from A” to A”'"’“^ with t = 0. This amounts 
to computing the regularizati on path between A"' and A"'”*'^ using homotopy methods 


as the iCap algorithm done in 


Zhao et al 


20091]. This solution path is piecewise linear. 


2 Fix A and calculate the solution path between u(0, A"’"''^) and u(l,A”+^) using the 
homotopy approach. This is derived by proving that the solution path is piecewise 
smooth in t. The algorithm computes the next “transition point” at which active 
groups and solution signs change, and updates the solution until t reaches 1 . 


4 Knowledge Gradient for Linear Model with oo Group Lasso 


In Section 13.21 we review knowledge gradient policy for linear belief models in low di¬ 
mensional settings. In this section, we derive the KG policy in a high dimensional linear 
model. We have /x = Xq, where X € is the alternative matrix and a G M™, /x G 

are random variables. Here m can become relatively large and a is sparse in the sense that 
only a few components of a contribute to /x. However, unlike the sparsity assumption in 
classical frequentist statistics, we assume the sparsity structure is random; that is, the group 
indicator variable of which is selected or not is a random vector. Specifically, we now assume 
there exists some known group structure in a, let (^ = [Ci,..., Cp] £ be a group indicator 
random variable of a. 


Cj 


1 if OLg^ 7 ^ 0 

0 if ag^ = 0 


for j = 1 ,... ,p. 


Additionally, a is assumed to be sparse in the following sense, 


a|C~AA(r9,X’’). 


( 6 ) 


Let 5 = {j : Cj = !}• Thus, without loss of generality, conditioning on (^, we can permute 
the elements of a to create the following partition, 

o?' = [(a5)'^,0], 

where a^ ~ )• So r? and can be correspondingly partitioned 


'd = 

'^s 

, = 


0 


0 


0 

0 
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Here we make a critical assumption on the distribution of ol. Let us assume that con¬ 
ditioning on ^ = 1, a has the following distribution; q:|(^ = 1 ~ Then for any 

other the conditional distribution of a on is normal with mean Ogi and variance 51^,. 
Here S' = {j : Cj = !}• This means that we can write all the conditional distributions of 
OL through an index set S characterized by (^. So in the following we use both and S as 
indices. We also omit the time dependent variable n to simplify notations. Furthermore, as 
we are updating the mean and covariance matrix of a certain conditional distribution, we 
also update all the elements with the same index in the other distributions. That means, 
through all the updatings, we just need to maintain the mean and covariance matrix on 

C = 1- 


4.1 Knowledge Gradient Policy for Sparse Linear Model 

Before deriving the sparse knowledge gradient algorithm, let us describe the Bayesian 
model at time n. To get a Bayesian update, we can maintain Beta-Bernoulli conjugate 
priors on each component of (^. At time n, we have the following Bayesian model, for 
J,/ = 1, • • • 

= 1 (7) 

Q\p^ ~ Bernoulh(p-), (8) 

for j//, (9) 

p-|^;,r/;~Beta(^”,r?-). (10) 

At time n, the prior is a discrete random variable. Let ..., be all the pos¬ 
sible realizations of C”, and P(<^” = = 1,..., So by the Law of Total 

Expectation, the KG value can be computed by 


V 


KG,n 

X 


where 


E(yn+l(^n+l(^)) _ ^ 


E(max , x" = x) — max 0L 

x'&x a; I ’ 

Epn E^n |pn E^ ^ g I ( rnax oq ^15"',®’^ 


x,C,p" 


max 0”/ 

x'&X 


^Epn(p^’^)/i(a”’^b^’^) 

k=l 


Nt 


E n 




k=l 


LX"’ =1} 


q + v 


n 


Vi 


^ Lx;’"=o} 


e + r/’ 




h{a, b) 

:= E[maxaj + biZ] — max a*, 

i i 

quM 

— -19^ 

^ ^n,k 5 

fjn,k 
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Note that conditioning on each sample realization of the KG calculation is identical with 
KGLin. The KG value for a sparse linear model is a weighted summation over all the possible 
sample realization of of which the weight Epn(p"^’^) is computed by the independent Beta 
distributions on all the p”’s. Also, if N(^ takes its largest possible value, that is N(^ = 2^, we 
can re-sort the weights and approximate the knowledge gradient value by only computing 
ones with several top largest probabilities. 


4.2 Bayesian Update 


At time n we have the Bayesian model described in (j7|) - (llOD . Parallel with that, we also 
have the current Lasso estimate, denoted as i?”'. The nonzero part is The covariance 
matrix corresponding to is denoted as which is Monte Carlo simulated from the 

first order optimality condition of the optimization problem Q (details described in Section 
asi). After we get the new observation, we can update to the next Lasso estimate recursively 
by the algorithm described in Section 13.31 Thus we have the updated Lasso estimate 

Let := {j : dg, / 0}. The Bayesian updating equations are given by 


and S 




Gelman et al 


20031]: 




(s 






1 -1 


^n+1 




5 + 


( 11 ) 

( 12 ) 


+ 1 ’ 


for 

for 




Now we briefly recall and summarize the random variables which play a role in the 
measurement process. The underlying and unknown value of alternative x is denoted fix and 
parametrized by a. Here a follows a “mixture” normal distribution by Q and Q follows a 
conditional Bernoulli distribution with the frequency of “in” and “out” denoted by and 
rjj. Both a and C are randomly fixed at the beginning of the measurement process. At time 
n, C” and i?” give us the best estimate of a. is the precision with which we make 

this estimate. The result of our time n measurement causes us to first update the Lasso 
solution from i?” to 19"+^ and then update the mean estimate from •dg to which we 

now know with precision (51^’"'"*’^)“^. 

One may think of C and a as fixed and of C" as converging toward and as converging 
toward a while some norm of the precision matrix (5]^’”)“^ converging to infinity under some 
appropriate sampling strategy. It is also appropriate, however, to fix and 'dg and think 
of and a as unknown quantities. Furthermore, from this perspective, the randomness of C 
and a does not imply they must be chosen from Bernoulli and mixture normal distribution 
respectively, but instead it only quantifies our uncertain knowledge of C and a adopted when 
they were first chosen. 
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4.3 Knowledge Gradient with Recursive £i oo Group Lasso 

In this section, we first provide a technique to approximately sample the covariance 
matrix from the first order optimality condition in problem ([5]). Then we outline the 

knowledge gradient policy for sparse linear models in Algorithm [H 

We begin with a series of set definitions. Figure [T] provides an illustrative example. Let 
us divide the entire group index into V and Q respectively, where V contains active groups 
and Q is the complement. For each active group j G P, we partition the group into two 
parts: Aj with maximum absolute values and Bj with the rest of the values. That is 

Aj = argmax |/3fc|, Bj = Qj — Aj, j G V. 

k^Gj 


ABC 


Gi 




Figure 1: Illustration of the partitioning of a 20 element coefficient vector /3 into five groups 
of four indices. The sets V and Q contains the active groups and the inactive groups, 
respectively. Within each of the two active groups the coefficients with maximal absolute 
values are denoted by the black color. 


The set A and B are defined as the union of the Aj and Bj sets, respectively. 


A = yjj^'pAj, B = Uj^'pBj. 


Finally, we define 


c = Uj^QQj, Cj = Qj n c. 

The £ 1^00 group Lasso problem ([5]) can also be written as 

/J'" = argmin - /3'^r” + A”||/3||i,oo- (13) 

This optimization problem is convex and nonsmooth since the £ 1,00 norm is nondifferentiable. 
Here there is a global minimum at if and only if the sub differential of the objective function 
at /3 contains the 0-vector. The optimality conditions for (I13p are given by 

TC-^l3-r^ + \^z = 0, G 5||/3||i,oo. (14) 
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We also have that z G 5||/3||i,oo if and only if z satisfies the following conditions, 


II^T.IIl 

= 1, j e 'P, 

(15) 

sgn(z^.) 

= sgn(/3^.), jeV, 

(16) 

Zb 

= 0 , 

(17) 


< 1, j G Q, 



where A,B,C,V and Q are /3-dependent sets defined above. For notational convenience we 
leave out the time variable n in the set notation. As (3c = 0, (|14l) implies that 

-r^ + X^zs = 0 , 

-r^ + X^zc = 0. 

If ' is invertible, then the solution is unique and we can rewrite ()18h as 

/35 = (Rr')-'(r-S-A”z5). 

Let G ^nxm design matrix at time n defined as 

(X-Y := 

and 

Y- := 

Then (frUjl is equivalent to 

Ps = - ^”^ 5 ]. ( 20 ) 

Let ^. Since the elements of Y"" are independent and Cov(Y"') = 

cT^i, dsni) gives us 

Cov(/35)(”) = M”-V2 + (A”)2m^-1Cov(z5)(")M”-i. (21) 

By definition, := Cov{Ps)^^'^■ If we replace n with n + 1 , m provides us with the 
equation 

^ ^ (A’*+i)2MgCov(z5)('^+^)Mg. (22) 

One should note that we can not directly compute from the right hand side of (1221) . 

since zg is also a random variable dependent on But assuming that should not be 

far from one can sample a set of random variables from the distribution AA('i3^, X^’”') and 
then sample the subgradients according to the equations (fT5]) , (fTHj) and (fT7)) , so Cov(z5)("+^) 
can be estimated from the sample covariance matrix estimator Cov(z 5 )("^"*'^). Additionally, 
to make this estimator stable in theory, we need to make sure that all the eigenvalues of 


(18) 


(19) 
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are bounded away from 0 and infinity. Heuristically, we first define a matrix 
space A^(Cmin; C^max) 

-^(C^min; C^max ) = {M : Crain < Amin(M) < Amax(M) ^ C*max}' 

Then we can project Cov(z 5 )^"'''^^ into A4(C'min) C'max) and find a solution Cov( 2 ; 5 )(”"''^) to 
the following convex optimization problem 

c 5 v( 25 )(^+i) = argmin ||C^(:Z 5 )("+^) - M||ir. (23) 

]VI G ( C*inin) C'max ) 

Empirically we can use a surrogate projection procedure that computes a singular value de- 
composition of Cov(jZ 5 )("+^) and truncates all the eigenvalues to be within interval [Cmin, C'max]- 
Therefore we can approximately estimate by 

g^,n+l ^ ^ (A^+l)2MgCOT(;Z5)'^+^Mg. (24) 

Now we have all the ingredients for the knowledge gradient policy for sparse linear model 
(KGSpLin) and we outline it in Algorithm [TJ 


Algorithm 1 Sparse Knowledge Gradient Algorithm 


Input: 19°, X, 

Output: 

for n = 0 ; A — 1 do 


1 T) ICG.n 

1. KG: X = argmaxux 


2. Lasso homotopy update]^ i?”, {x^, G K™' x M, A"", —)■ 

3. Monte Carlo Simulation; approximately simulate from 

4. Bayesian update to: 


end 


5 Knowledge Gradient for Sparse Additive Model 

As we have the sparse knowledge gradient algorithm for ii^oo group Lasso, we can gener¬ 
alize the knowlege gradient for sparse linear model to a nonparametric sparse additive model. 
In this section, we first describe the knowledge gradient for a sparse additive model, then we 
generalize it to the multivariate functional ANOVA model through tensor product splines. 

5.1 Sparse Additive Modeling 

In the additive model, p = [yi,... iHmY' G 1^^, X = [Xij] G is the alternative 

matrix and 

_ p _ 

yi = f(Xi^) = Q+ '^fj{Xij), for z = 1,... ,M, (25) 

i=i 

^In practice, we often begin with some historical observations. Thus in the first iteration the Lasso 
estimator can be obtained from the historical dataset. 
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where the fjS are one-dimensional smooth component functions, one for each covariate and 
^ = [?i,..., is the residual term. For simplicity and identification purposes, we assume 
^ = 0 and f fj{xj) dxj = 0 for each j. When fj{x) = ajX, this simply reduces to the linear 
model in Section 01 In a high dimensional setting, where p may be relatively large, we assume 
most of the fjS are zeros. 

If the truth takes the nonparametric additive form as in (|25p . similarly, we let the choice 
of which fj is selected or not be random. Let C = [Cii • • • > CpV' ^ be the indicator random 
variable of /j’s, that is. 


0 


1 if f, + 0 

0 if /, = 0 ’ 


for j = l,...,p. 


Firstly, let us approximate each functional component in (I25|) through one-dimensional 
splines. Without loss of generality, suppose that all elements of X take values in [0,1]. Let 
d = tq < Ti < ■ ■ ■ < tk < tk +1 = 1 be a partition of [0,1] into K + 1 subintervals. Let Si be 
the space of polynomial splines of order I (or degree I — 1) consisting of functions h satisfying 


1 the restriction of h to each subinterval is a polynomial of degree / — 1; 

2 for I > 2 and 0 < I' < I — 2, h is V times continuously differentiable on [0,1]. 


4.1 in 


Schumaker 


1981 


This definition is phrase d after IStond 1985l | , which is a descriptive version of Definition 


p. 108]. Under suitable smoothness assumptions, the fj's can 
be well approximated by functions in 5;^.. Specifically, let fj E Si. be the estimate of fj. 
Furthermore, for each fj, there exi s ts a n ormalized B-spline basis {4>jk{x), 1 < k < dj} for 


Si ., where dj = K + Ij [Schumakerl . Il98ll | . If we let otj, = [aji ,..., ajdj] be the coefficients 


of fj projected onto Si., then for any L E 5/., we can write 


fji^) = oijkfjk{x), for 1 < j < p. 


(26) 


k=l 


Algorithm 2 Knowledge Gradient Algorithm for Sparse Additive Models 


Input X 


i9,0 


,{{».X. 


dj,p 


\K+1 


Output: {/f 


'j >] 

for n = 0 : A — 1 do 

1. KG: a;” = argmaxu,^^’""; 

2. Lasso homotopy update: i?”, {4>jk{x'j ),y^~^^) E M™' x M, A", 

3. Monte Carlo Simulation: approximately simulate from 

77 

’ ’h 


4. Bayesian update to: 


end 




^The prior mean and covariance matrix can also be obtained by some priors on ffs. 
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Let Q = [qi,, ... ,Q:p»]- We assume that a takes the conditional distribution 

and also has the sparsity structure as described in Section [H Then at time re, we also have 
estimate /” from group Lasso based on one-dimensional splines. More Specifically, for each 
/” G Sij, let 1 ?”, = ..., be the coefficients of /” and let i?"' = [i?”,,..., i?”,]. 


Accordingly, in the batch setting, where we already have re samples (jc* G M''- x 


1,..., re, one can get by solving the following penalized least squares problem 

2 


1 


n 


I?” = argmin — 


E 

i=l 


y'-tt '& jk4*jkix j ) 


P 


i=i 


OO) 


= 


(27) 


j=i k=i 

where A is the tuning parameter. Optimization problem (j27p is essentially an ii^oa group 
Lasso optimization problem. The parameter p is the number of groups and the group sparse 
solution on would lead to a sparse solution on /j’s. Accordingly, we can also derive the 
knowledge gradient policy and Bayesian updating formulas as in Section HI Here we let /” 
be the Bayesian estimate of fj at time re, that is, 


^'jk(l>jk{x), for l<j<p. 


k=l 


We outline the knowledge gradient algorithm for sparse additive models (KGSpAM) in Al¬ 
gorithm [2j 


5.2 Tensor Product Smoothing Splines Functional ANOVA 


If the regression functions in (|25p can also take bivariate or even multivariate functions 
this model is known as the smoothin g spline analysis of variance (SS-ANOVA) model 


1990, 


Wahba et al 


Wahba 


1995 . IGuI. 12002 ]. In SS-ANOVA, we write 


T ^ ^ ^ ^ fjki^iji ^ik) T 


(28) 


j=l j<k 

where /^’s are the main effects components, fjkS are the two-factor interaction components, 
and so on. <j is the residual term. Similar as before, we assume <; = 0, f fj{xj)dxj = 0 for 
each j, JJ fjkixj,Xk) dxjdxk = 0 for each j, k and so on. This model is also called functional 
ANOVA. The sequence is usually truncated somewhere to enhance interpretability. This 
SS-ANOVA generalizes the popular additive model in Section 15.11 and provides a general 
framework for nonparametric multivariate function estimation, thus has been widely studied 
in the past decades. 

As we approximate each fj by Si^, under certain smoothness assumptions, fjk can be 
well approximated by the tensor product space defined by 


Si - ( 8 ) 54 : = {hjhk : for all hj G Si^,hk G 5 ;^} 

= ^ Crg(pjr4>kq ■ for all Crg G M}. 

r=l q=l 
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Let 


(pjrkq{Xj,Xk) := (l)jr{Xj)(j)kq{Xk), ioT 1 < r < dj, 1 < q < dk, 

then these are the basis functions for djdk dimensional tensor product space Si^ This 

can also be generalized to multi-factor interaction components. Therefore, similarly, we can 
write all the functional components in ()28p as basis expansion forms. Then we can generalize 
a knowledge gradient algorithm for SS-ANOVA model. 

6 Theoretical Results 

In this section we provide the estimation error bounds of the Bayesian posterior mean 
estimate in Algorithm [T] as well as of the functional estimate in Algorithm [2J We first state 
the selection and estimation properties of ii^oa group Lasso in high dimensional settings 
when the number of groups exceeds the sample size. We show the estimation error bound of 
group Lasso. We also provide the sufficient conditions under which the group Lasso selects 
a model whose dimension is comparable with the underlying model with high probability. 
Based on these results, we assume that we begin with some historical observations and the 
Lasso estimator from the historical dataset has good initial property. If we have such a 
“warm” start, we can show that the Bayesian posterior estimation error is bounded as in 
Theorem [TJ The theorem actually shows that the posterior can converge to the truth at the 
same rate as that of group Lasso. Besides, based on this error bound, we can also show the 
estimation error bound of the functional estimate as in Theorem [21 Note that these error 
bounds are proved on the intersection S of the support set 5” from group Lasso estimator. 
But we can also prove that S is comparable with the true support set S*. Additionally, all 
these theorems show the estimation error bounds as large enough measurements are made. 
Since our policy is also myopically optimal by construction, this lends a strong theoretical 
guarantee that the algorithm will work well for finite budgets. 

6.1 Bayesian Posterior Mean Estimation Error Bonnd 

In addition to the aforementioned notation, let e” = [e^,... ,6”]^ be the measurement 
noise vector, so we have Y"’ := -|- e”. Then, we define the maximum group size 

d := maxj=i^...^p dj and the minimum group size d := minj=i^...^p dj. Let d = d/d. Let 
‘5” = {j ■ Tdg. 7 ^ 0} be the estimated group support from current Lasso estimator. Let S* 
be the true support. Also, let s* = |5*| be the cardinality of S*. 

Before proving the estimation error bound, let us first introduce the selection and estima¬ 
tion properties of ii^oo group Lasso. Our presentation will need the following assumptions. 

Assumption 1. For any n, the random noise errors e^, ... ,e” are independent and identi¬ 
cally distributed as 
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Assumption 2. The design matrix X"" ^ satisfies the sparse Riesz condition (SRC) with 
rank r and spectrum bounds 0 < c* < c* < oo i/ 


11 1 '' 11 2 ^ 




MS with r = 151 and u G 


j€S 


n 


We refer to this condition as SRC (r, c*,c*). 

Assumption 3. For a given group Q = {Gi,... ,Gp}- We say is block normalized if 

’■n— 

^*5j 


X n —111 

l|2 


n 


< 1, for all j = 1,2, 


,P- 


Remark 1. In Assumption\^ we set the upper bound to one in order to simplify notation. 
This particular choice entails no loss of generality. Note that this assumption is a natural 
generalization of the column normalization condition. Specifically, if we have m = p groups, 
each of size one, the matrix norm reduces to the vector norm on every column oflK^~^. 

All three assumptions can be reasonably expected to hold in practice. Assumption [1] is 
on the distribution of random noise. The SRC in Assumption [2] assumes the eigenvalues of 
the sample covariance matrix ^ ;= are bounded below from zero and 

above from infinity when the size of S is no greater th an r. It is natural to ask whether 
such condition also holds for general matrices. In fact, IZhang and Huana 20081 1 provides 
sufficient conditions for the sparse Riesz condition for both deterministic and random design 
matrices X. As we consider the designs are deterministic in thi s work, we only present the 
sufficient condition for deterministic design matrices proved bv IZhang and Huand 2008t | in 
the following proposition. 


Proposition 1. Suppose X 


n—1 


(XT ^fn be the correlation. If 


is column standardized with ||XT = 1. Let pjk = 




max inf 
|5|=rK>l 


E E 


<5 < 1 , 


then the sparse Riesz condition in Assumption [3 holds with rank r and spectrum bounds 
c* = 1 — (5 and c* = 1 + <5. In particular, Assumptionl^ holds with c* = 1 — <5 and c* = 1 + <5 

if 

5 


max \pjk\ < 


r — 1 ’ 


d < 1. 


Based on these as sumptions, we can combine the results in IWei and Huand 2nifll | and 


Neeahban et al.l j2012l | and get the estimation error bound for £ 1^00 group Lasso estimator as 


given in Lemma [T] 

Lemma 1. Under AssumptionUl andO if we solve the group Lasso given in 


with 


A"' = 0{dy^n logp), 

then the following properties hold with probability converging to 1: 
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(1) |5"'| < for some finite positive constant Ci. In specific, Ci = 2 + 4dc, where 

c := c* jc^. 

(2) Any optimal solution /3"’ to ([5]) satisfies the following error bound 

for some positive constant C 2 - 

As one can see from the updating equations in ()13p and (jl 2 p . the posterior mean estimate 
is the weighted sum of prior and the current Lasso estimate If the Lasso 

estimate has £2 estimation bound as described in Lemma [H the posterior estimate should 
also have a similar bound under certain conditions of the weighted covariance matrix. One 
should note that both the mean and covariance are updated on some support S from the 
current Lasso estimate. Thus we will work on a sequence of Lasso solutions and prove the 
bound on the intersection support set as large enough samples are made. Also note that in 
order to use the bound in Lemma [U we need to make sure that assumptions [H [2] and [3] are 
satisfied for every Lasso problem in such a sequence. Assumptions [T] and [3] are easy to satisfy. 
To show all the sequential Lasso problems satisfy Assumption [2l we work from a “warm” 
start at time N'. The following proposition actually verifies that if the design matrix at time 
N' satisfies Assumption [2l then the following ones should also satisfy this assumption, only 
with a slight loose on the constant. 

Proposition 2. If for any n, there exists some constant B > 0 such that ||®’^||2 < B- 
Besides, assume for some large enough N', the design matrix satisfies condition SRC 

(r, c*,c*). Then, for all N' < n' < cN', of which c> 1 is some constant, the design matrix 
Xn'-i 

can satisfy condition SRC (r, c*/c, max(c*, i?)). 

Thus we have all the ingredients to complete the proof of £2 error bound of the Bayesian 
posterior mean estimator. Before that, let us state some assumptions for this result. 

Assumption 4. For any n, there exists some constant B > 0 such that ||®"^||2 < B. 

Assumption 5. For some large enough n, suppose for some constant c > 1 and n < cN', 
the design matrix satisfies the block normalization condition 0 and condition SRC 

(C 3 S*, c*, c*), where C 3 := 2 + 4dcmax(c*, B)/c^,. 

Under these assumptions, we have the following theorem of the £2 mean posterior esti¬ 
mation error bound. 

Theorem 1. Under Assumption [30 and\^ if we solve the group Lasso given in (l5|) with 

A"" = 0 {dy/n logp) 

and let s-= n”.=A»5"' , then the following properties hold with probability converging to 1 : 
(1) l^l < (^315*1 for some finite positive constant C 3 defined in Assumption\^ 
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( 2 ) Any posterior estimate i?"' from Algorithm[I\ satisfies 

ll^on . ||2 / C4a2s*(Flogp 
11^5 “ ^5ll2 S - , 

for some positive constant C 4 . 

6.2 Functional Estimation Error Bonnd 

Based on the results in Section ISTTl we can also get the error bound for functional estimate 
of Algorithm [2] in Section [5.11 To show this error bound, let us introduce more definitions 
and assumptions. 

Let /? be a nonnegative integer, let 5 E [0,1] be such that q = /3 +6 > 0.5, and L E (0, 00 ). 
Let 7i{q,L) denote the collection of functions h on [0,1] whose /3th derivative, , exists 
and satisfies the Holder condition with exponent 5, 

\h^^\t') — h^^\t)\ < L\t' — for 0 < t, < 1. 

Whenever the integral exists, for a function h on [0,1], denote its || • II 2 norm by 

\\h \\2 := K^{x)dx, 

Additionally, for any S C {1,... ,p}, we define 

Il^<sll 2 := \\hj\\l- 

To prove the functional estimation error bound, we assume the true functions belong to this 
function class with smoothness parameter q = 2 . 

Assumption 6. fj E 'H{2, L) for 1 < j < p. 

Also note here we have the new design matrix on the basis fijk- Let be 

the n X dj matrix '^j{i,k) = where ifjk is the orthonormal B-spline basis. Let 

:= ..., Based on this and Theorem [H we have the following theorem of 

the functional estimation error bound. 

Theorem 2. Under assumptions\^ and\^ if the design matrix satisfies Assumption 

[3| and the group Lasso is solved with some A" satisfying 

A"' = 0{d^n logp), 

let S d = s* = 0 ( 1 ), then the following properties hold with proba¬ 

bility converging to 1 : 

(1) |5| < Osj^*] for some finite positive constant C 3 . 
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(2) Any posterior estimate /”■ from Algorithmic satisfies 




logp 


n 


2/3 


where C5 is some positive constant. 


Remark 2. Note here we use group Lasso instead of £1^2 group Lasso, this is because 
the homotopy algorithm for recursive ii^oo group Lasso largely reduces the computational 
complexity, but we do not have such results for £1^2 group Lasso. However for £1^2 group 
Lasso, the bound takes the form ||/3"' — j[g ggg^ ^/jg g^^gg term for 

£ 1,00 group Lasso ^ d, logp larger by a factor of d, which corresponds to the amount by 
which an £ca-ball in d dimensions is larger than the corresponding £2-ball. Therefore, we 
do not achieve the minimax optimal rate as in £1^2 group Lasso. Thus using £i^cx} group 
Lasso instead of £1^2 group Lasso is actually a tradeoff between computational complexity and 
statistical estimation. 


7 Experimental Testing 

In this section, we investigate the performance of KGSpLin and KGSpAM in controlled 
experiments. In these experiments, we repeatedly sample the truth from some distribution 
and compare different policies to see how well we are discovering the truth. 

We first test the KGSpLin by generating a linear model with p = 100 predictors, in ten 
groups of ten. The last 80 predictors all have coefficients of zero. The coefficients of the first 
2 groups, that is 20 predictors, are randomly sampled from a normal distribution with means 
from 11 to 30 respectively, with standard deviation of 30% of the mean. We randomly choose 
M = 100 alternatives from some Gaussian distribution. Finally, normal measurement noise 
with standard deviation e is added to each observation. In our first experiment, we focus on 
the comparison with KGLin and exploration policies using a relatively large measurement 
budget N = 200. 

Furthermore, of all the experiments in this paper, to make a fair comparison of KG 
and exploration, the updating scheme when using the exploration policy is as described in 
Section The only difference is that at each iteration, exploration randomly measures 
each alternative with the same probability, while KG chooses the one with maximum KG 
value. Also, we assume that we do not have any prior information on the sparsity structures. 
That is, fj =11 j = 1) for j = 1) • • • jP- 

Figure Ela) shows the corresponding misclassification groups for KGSpLin and KGLin 
as the regularization parameter A is varied. (A misclassified group is one with at least one 
nonzero coefficient whose estimated coefficients are all set to zero, or vice versa.) Figure 
[2jb) and (c) show the log of the averaged opportunity cost over 300 replications using a well 
chosen tuning parameter with low and high measurement noise (the standard deviations of 
the measurement noises are respectively 5% and 30% of the expected range of the truth). 
Here the opportunity cost (OC) is defined as the difference in true value between the best 
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option and the option chosen by the policy, that is 


OC = max^j — 

i 



Figure 2: (a) shows the misclassified groups for KGSpLin and KGLin as the regularization 
parameter A is varied. (b)(c) shows the averaged opportunity cost over 300 runs under low 
measurement noise (5% range of the truth) and high measurement noise (30 % range of the 
truth). 


From Figure [2] we can see that during the first several iterations, KGSpLin behaves 
comparable with pure exploration, because Lasso takes several iterations to identify the key 
features. However, after several initial samples, KGSpLin far outperforms both KGLin and 
pure exploration. This is because Lasso gives a rather precise estimate of the sparse linear 
coefficients given enough samples. So the algorithm mainly updates the beliefs on the key 
features based on these Lasso estimators, leading to more pr ecise estimates of the m odel. 

To further compare the KGSpLin policy with KGLin from lNegoescu et al.l 201 ll | for high 
dimensional sparse belief functions, we take several standard low dimensional test functions 
and hide them in a p = 200 dimensional space. These functions were designed to be min¬ 
imized, so both policies were applied to the negative of the functions. Table [T] shows the 
performance on the different functions. Each policy was run 500 times with the specified 
amount of observation noise. Table [T] gives the sample mean and standard deviation of the 
mean of the opportunity cost after = 50 iterations of each policy. Here each function is 
scaled to have a range of 100, so that the measurement noises are given on the same scale. 

Furthermore, we now test the KGSpAM policy on the following SS-ANOVA model with 
p = 100 and four relevant variables. 


P'i — fl2{Xi\,Xi2) + fj{Xij) -|- €i, Cj ~ AA(0,1); 
1=3 
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Test function 

cr 

E(OC) 

KGSpLin 

cr(OC) 

Med 

E(OC) 

KGLin 

cr(OC) 

Med 

Matyas 

1 

.0104 

.0256 

.0071 

.0284 

.0157 

.0244 

[-10,10]^ 

10 

.2772 

.1960 

.0125 

.3451 

.1166 

0.3781 


20 

.7658 

.8423 

.3997 

1.7155 

.3208 

1.5627 

Trid 

1 

2.1422 

1.4011 

1.1843 

2.7092 

1.5331 

1.3036 

d = 6,X = [-36,36]® 

10 

9.8196 

3.8757 

8.9874 

9.9787 

4.2098 

8.2282 


20 

15.7164 

4.0201 

14.9040 

16.8911 

4.5881 

15.4959 

Bohachevsky 

1 

.0746 

.0249 

.0035 

.0853 

.0370 

.0013 

[-100,100]^ 

10 

.3585 

2.5349 

.2876 

.5611 

2.7056 

.2993 


20 

1.8224 

3.230 

1.5578 

1.9668 

3.696 

1.7008 

Six-hump Camel 

1 

.0023 

.0019 

.0000 

.0117 

.8097 

.0000 

[-3,3] X [-2,2] 

10 

.0895 

.6332 

.0000 

.1293 

.6098 

.0000 


20 

.4922 

.2159 

.0215 

.6183 

.2696 

0.0306 


Table 1: Quantitative comparison for KGSpLin and KGLin on standard test functions. Each 
row summarizes 500 runs of each policy on the specified test function. We compute the mean, 
standard deviation and median of OG. Each function is scaled to have a range of 100 and 
the results are given for different levels of noise standard deviation. 

the relevant component functions are given by 


fl2{xi,X2) 

= + ^ + X 1 X 2 +xl, 

0 

(29) 

h{x) 

= 2sin(27rx), 

(30) 

fi{x) 

= 8(x-0.5)^ 

(31) 

h{x) 

= 2exp(—3x), 

(32) 


where the first component function /12 in (|29|) is known as the Three-hump camel function. 
We plot the true Three-hump camel function in Figure [3l^a), while the key part is shown 
in Figure [3l^b). For / 12 , we use B-splines tensor product space ^4 ( 8 * ^4 to approximate it. 
The knot sequence is equally spaced on [—5, 5]^ with X = 4 (the number of subintervals for 
each dimension is ii' -|- 1 = 5). The remaining three relevant components are approximated 
using B-splines with order I = 4 and equally spaced knot sequence on [0,1] with X = 4. 
The alternatives are uniformly sampled on the domain with M = 400 and the measurement 
budget N is 30. The standard deviation of measurement noise is set to 20% of the expected 
range of the truth. 

Then we run the KGSpAM policy on a p = 100-dimensional space. To better visualize 
its performance, we plot the starting prior and estimated function of negative /12 on its 
key region after the initial 10 and 30 observations as shown in Figure 01 Comparing these 
estimates with the true function shown in Figure 01 we visually see that the policy has done 
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Three-Hump Camel Function Three-Hump Camel Function 



(a) (b) 


Figure 3: (a) shows the negative Three-hump camel function on its recommended input 
domain, and (b) shows only a portion of this domain, to allow for easier viewing of the 
function’s key characteristics. The function has one global maximum and two other local 
maxima. 


a good job estimating the lower key regions of the functions as desired after 10 observations 
and it identifies the areas of the three maxima after 30 observations. For the remaining 
three relevant functional components in (1301) . (I3T|) and (l32l) . we plot the prior, truth and 
final estimates of KGLin and KGSpAM in Figure [5j 



Figure 4: (a) shows the prior of negative Three-hump camel function on its key region, (b) 
and (c) show the estimates of negative Three-hump camel function on its key region after 
10 and 30 observations respectively. 


8 Conclusion 

In this paper, we extend the KG policy to high dimensional linear and nonparamet- 
ric additive beliefs. It is a novel hybrid of Bayesian R&S with the frequentist learning 
approach. Parallel with the Bayesian model, the policies use frequentist recursive Lasso ap¬ 
proach to generate estimates and update the Bayesian model. Empirically, both KGSpLin 
and KGSpAM greatly reduce the measurement budget effort and perform significantly better 
than several other policies in high dimensional setting. In addition, these policies are easy 
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Figure 5: (a)(b)(c) The prior, truth and final estimate of sparse additive model in (I30|) - (f32]l 
comparing KGLin and KGSpAM after N = 30 observations. The standard deviation of 
measurement noise is 1, which is about 20% of the expected range of the truth. 


to implement and fast to compute. Theoretically, we prove that our policies are consistent. 
That is, the estimates can converge to the truth when given enough measurements. This 
also guarantees the convergence to global optimal alternative. All these advantages make 
them reasonable alternatives to other policies for high dimensional applications with sparse 
structure. Despite the advances, the convergence theory requires a number of structural 
assumptions, suggesting that future research should look to identify algorithms that work 
with more general model structures in high dimensions. 


Appendix A. 

Refer to Table [2j 


Appendix B. Proofs 

In the following, we present the detailed proofs of all the technical results. 


B.l Proof of Proposition [2] 

Let us define be the sample covariance matrix, that is = 

For any N' < n' < cN', let us divide the design matrix 


Xn'-i 


XiV'-i 

X+ 


We need to prove X”^ ^ satisfies condition SRG (r, c*/c, max(c*, R)). Note that ^ 

satisfies SRG (r, c*,c*) is equivalent to 

c* < Amin(S^’'^ < Ainax(T^’'^ < c*, V5 with r = |5| and u E 
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^jrkq 
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X«-i 
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Description 

Set of alternatives 

Number of alternatives 

Number of measurements budget 

Unknown mean of alternative x 

Known standard deviation of alternative x 

Column vector (/ri,..., /tm)^ 

Sampling decision at time i (vector or scalar index) 

Measurement error of alternative x" 

Sampling observation from measuring alternative x" 

Mean and Covariance of prior distribution on /r at time n 
State variable, defined as the pair (0", S") 

Knowledge gradient value for alternative x at time n 
Vector of linear coefficients 
Number of features 
Alternative matrix 

Mean of covariance of posterior distribution on a after n measurements 
Number of nonoverlapping groups for features 
Group index 

Number of features in the jith group,dj = \Qj\ 

Prior of ^ at time n 

Parameter of Bernoulli distribution on (" 

Set of parameters of Beta distribution on p" 

Lasso estimate at time n 

Mean and covariance matrix estimator from Lasso solution at time n 
Index of selected groups from Lasso estimate at time n 
Active group index set 
Inactive group index set 

Index set in the jith group with maximum absolute values 

Index set in the jth group except for Aj 

Smooth function of the jth feature 

Number of interior knots for one dimensional splines 

Space of polynomial spline of order Ij 

fc-th B-spline basis function for Si^ 

Coefficient for fj on basis function (f>jk 

Two-factor interaction component in SS-ANOVA model 

rg-th B-spline basis function for Si^ (8) 

Maximum group size 

Design matrix with rows of a;, a;"“^ 

Smoothness parameter of the Holder class T-L 
Cardinality of the true grouB^^sct, s* = |<S*| 


Table 2: Table of Notation 




Then we have that for V5 with r = |5| 


.X,n'-1 _ {-^*S ) -^*S _ y-^*S ) ^*S + 1^*5i 


n' 


n' 


+ (KtsfX 


*<s 


n' 


This implies that 


and 


Since 


and 


n' c 


■.X,n'-lx ^ A /A^X,Ar'-i, 1 


An.ax(S^’" ■^) < —A^ax(S:^"^ ■^) + -Ara^UXt^Y X+ ]. 


{Xtsf^ts = i^S f + + • • • + "'K 


( 33 ) 


AmaxI^SC^s)"^] = ll^slli < B, Vn, 


we can get that 


Amax(Sc’ ) < —rC H- - — B<max[c,B). 

n' n' 


Combining ([551) and ([5^ completes the proof. 


B.2 Proof of Theorem [T] 


(34) 


The proof of part (1) directly follows Assumption [5l Proposition [2] and Lemma m We 
now proceed to prove part (2). If we let S := , then from updating formula in 

([TH]) and (fT^ . we have 




-,i 9 ,Ar'-lN-l nW'-l 






Then if we define 

:= 

:= 

for all — 1 < n' < n to simplify notation, we have 




tS,-!! 
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This gives us the following bound on S'g, 


< ||SI’ 112 


._ii usN' 


i2 + iii(s|„';)-‘i5ibii?|'ii2+ 


■■■+lll(SsiT‘lslbllS3ll2 


We now proceed to bound each of the quantities. Let us for now assume that N' <n' < n. As 
we suppose the design matrix for Lasso solution satisfies Assumption [3 by Proposition 
[2] and Lemma [H if we choose \^' such that 


A"" = 0 {d\/n' logp), 

then there exists some constant Cq such that 


II 2 < CsMa/ for all N' <n' < n, 


n 


(35) 


(36) 


with probability converging to 1. We know from (j2^ that 


si'' = mItV^ + (A-yMl7iCov(z.2,0^"'^Ml7\ 


where 


M"'7^ = 

gn' 


Assumption [5] gives us 


/W- 

yx"'-) 

-1 



iv'-l', 

5 ! 

* 

VI 

< 00 

f-l) 

1 

> 

V 

0 

*S / 

N'c* 


Amin(M 

for any S with |5| = C 3 S*. Therefore, since 15*^^ | < 6 * 35 *, by Proposition [21 we can show 
that for all N' <n' < n, there exist positive constants and (Pg, such that 

^7 


A„ia.(My) < ^< 00 , 

— 1 ^ 

~ n' 


A^i,(Mi7') > ^>0. 


(37) 

(38) 


It is not hard to prove 


Amin(MN) > Ai„in(M)Amin(N) 


for any positive semidefinite matrices M and N, so using Weyl’s inequality in matrix theory, 
and (l38l) . we have the following bound. 




b-ii2 < Ii(sy7 
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I 2 = ) 


+ (A“')'A„i„(Cov(zA,))Ai|JM"'.7') 
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Cgn' 
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for some constant Cg. Similarly, by (1351) . ([37|1 . and (l23ll . we can also get 


\^Sn' II 2 ~ ) 


< cT,^A„Aax(M” J^) + (A- )^A^ax(Cov(^p ))ALx(M^:7^) 


n' 


for some constant Cio- Thus, for the posterior covariance matrix, we have 


IS5 II2 - 
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15 
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A-i -I_ 

2Cigal(P\ogp 
{N' + n){n - N' + 1) 
CiialcPlogp 
n2 


for some constant Cn. If we let 

then combining p6]l . (l39]l and (Hn|l gives us the following bound on (5^ 
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CGCgVs*n' \ 

,d^/logp j 


^ Ci 2 credy/s*\ogp ^ Ciio-^cPlogpAg(A^O 
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for some constant C 12 , which is equivalent to 




(40) 


(41) 


n 


and thus completes the proof. 


B.3 Proof of Theorem [2] 

By definition of /j, 1 < j < p, part (1) follows from part (2) of Theorem [1] directly. Now 
consider part (2). We denote fj as 

^ dj 

= '^'^iki’jk{x), for l<j<p. 
k=l 

We also have 

dj 

= Y1 '^Ik'^jkix), for l<j<p. 
k=l 
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Since ifjjk is the orthonormal basis, we have 


Also by Assumption [6] and Lemma 8 in 


Stone 


198fil |. taking q = 2, we have 


-fjf = 0{df'^) = 0{d-^) 


Thus by the result of Theorem [U we have 

II ^ ||2 / Cia^s*(Plogp , Ci3 

\Us-M < ^ +^- 

Note that choosing d = and s* = 0(1) would not change the rate in equation (HU, 

so we have the following bound 


ll/|-/5lli< 


Oscr^logp 


n 
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